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ABSTRACT 


This study examines data from a field experiment investigat- 
ing the effects of a personalized recommendation algorithm 
that proposes to students which videosto watch next, after 
they complete mini-assessments for algebra that available 
on the Math Nation intelligent virtual learning environment 
(IVLE). The end users of Math Nation are students enrolled 
in an Algebra 1 course in middle and high schools of the 
state of Florida, and the IVLE is used both during and out 
of school time. The objective of the developed recommenda- 
tion algorithm is to increase student preparation to take the 
state-mandated End-of-Course (EoC) Algebra 1 assessment 
at the end of the school year. The algorithm is based on a 
Markov Decision Process framework that uses as input the 
students’ responses to a series of mini-assessment tests. The 
current study randomly assigned 16,406 students to either 
treatment or control conditions, which were blind to both 
students and teachers. The results indicate that the effects 
of the recommendation algorithm depend on the level of us- 
age of students, showing significant improvements on EoC 
test scores of students who have a moderate level of usage. 
However, there was no effect for low usage students. The 
study also shows that students practicing with the mini- 
assessments available on Math Nation, helps them improve 
by a small margin their performance on the End-of-Course 
test, irrespective of the usage level. Finally, the study pro- 
vides insights on challenges posed for implementing person- 
alized recommendation algorithms at a large scale, related 
both to student self-regulation and teacher orchestration of 
technology use in the classroom. 
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1. INTRODUCTION 


There is a growing trend in employing intelligent virtual 
learning environments (IVLE) to aid students in improving 
their math performance in K-12 education [8, 26, 23]. While 
there is a robust body of literature that shows that students’ 
preparedness together with various demographic and school 
characteristics are key factors for predicting students’ perfor- 
mance in various math tests [15], IVLE have been viewed as 
an especially promising way of improving students’ achieve- 
ments in mathematics. Given the investment of resources 
into technology products and the time and effort needed to 
integrate them into the curriculum, there has been consid- 
erable interest in determining their effectiveness. A number 
of studies have reported positive effects based both on small 
scale randomized control trials and longer term interventions 
[14, 13, 19, 16, 15], as well as based on observational data 
[12]. There have also been a series of meta-analysis stud- 
ies showing that IVLE have substantial effects on student 
outcomes [10, 11, 24, 27]. 


IVLE have the potential of offering personalized learning ex- 
periences. The latter refer to instruction “in which the pace 
of learning the instructional approach are optimized for the 
needs of each learner”, according to the United States Na- 
tional Education Technology Plan 2017. IVLE that offer 
some degree of personalization include Khan Academy at the 
K-12 level and Newton at the higher education level. As dis- 
cussed in [3], at the core of personalized learning strategies 
is a recommendation algorithm aiming to propose appropri- 
ate learning materials and topics to the student at the right 
time, leveraging the student’s prior history of interactions 
with the IVLE. 
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Many personalized learning strategies leverage ideas and 
tools from the field of Reinforcement Learning [4, 5, 9, 20]. 
The key components of a reinforcement learning based al- 
gorithm are the triplet of state, reward, action. The state 
reflects information on the student’s knowledge and skills 
set on the topic(s) under consideration, the reward relates 
to the goals of the strategy (e.g. performance on tests, en- 
gagement with the IVLE, etc.) and the action refers to an 
activity, (e.g. watch a video on a topic of interest, take 
an assessment test, etc.) that, based on the current state 
information, aims to maximize the expected reward. 


This study reports the results of a large-scale randomized 
field experiment that focuses on the impact of a simple per- 
sonalized strategy implemented on the Math Nation IVLE, 
on a high-stakes, state-mandated End-of-Course (EoC) al- 
gebra test. Although many evaluations of IVLE have been 
published, most of them rely on locally developed standard- 
ized tests, rather than high-stakes statewide tests [10]. Math 
Nation, is an online video-based tutoring program aiming to 
prepare students in the state of Florida for the EoC, which 
is required for high school graduation. The platform offers 
videos on various algebra topics recorded by different tu- 
tors, explaining the main concepts and walking the student 
through related examples, 3-question assessments for each 
topic and 10-question assessments for sets of related topics, 
with video explanations for each question. Therefore, stu- 
dents can assess their progress by taking both the short (3- 
question) and the long (10-question) tests. Further, the plat- 
form offers a monitored discussion area, wherein students 
can pose questions to peers and volunteer tutors. Hence, 
at launch time, it shared a number of characteristics with 
Khan Academy, both being self-guided and easy to use on 
an ad hoc basis, without the need for extensive professional 
development training for teachers. The content of the videos 
and assessments are aligned with the curriculum adopted by 
the state and also the content and format of the EoC test. 


A new feature of Math Nation is the introduction of an al- 
gorithm to recommend videos to students, leveraging infor- 
mation on their performance on the mini-assessments asso- 
ciated with each video. Specifically, Math Nation divides 
the whole Algebra 1 course materials into 10 sections. Each 
section is further divided into several topics, thus result- 
ing in a total of 93 topics for the entire course. For each 
topic, there is a tutorial video associated with it, recorded 
by different tutors. At the end of the video the student is 
presented with a 3-question assessment (henceforth called a 
mini-assessment) and based on the score obtained, a video 
recommendation (the action) is offered aiming to maximize 
the student’s expected score (the reward) on these mini- 
assessments. The student can follow the recommendation 
or decide to ignore it and select another video of her/his 
own choice by the same or another tutor. To compare the 
effectiveness of the recommendation algorithm, a “business- 
as-usual” competitor is implemented, which recommends the 
next video in a predetermined sequence related to the struc- 
ture of the algebra state curriculum, irrespective of the score 
achieved in the mini-assessment. 


The objectives of the study are twofold: (i) estimate the 
average treatment effect of the recommendation algorithm 
vis-a-vis its competitor together with its interactions with 


previous achievement and level of usage of the algorithm, 
and (ii) understand the relationship between performance in 
the mini-assessments and the EoC test, after accounting for 
math preparedness and school characteristics of the students 
that participated in this randomized control study. 


The remainder of the paper is structured as follows. Sec- 
tion 2 presents the developed personalized recommendation 
strategy. Section 3 describes in detail the data recorded from 
the algorithm, as well as other covariates used in the anal- 
ysis. Section 4 presents the statistical methods used in the 
analysis and the main results of the study. Finally, Section 
5 discusses the implications of our findings and suggestions 
to modify the recommendation algorithm. 


2. PERSONALIZED RECOMMENDATION 
STRATEGY 


Next, we describe the data-driven algorithm for recommend- 
ing a suitable tutoring video to each individual student. 
As previously mentioned, the content of the course is di- 
vided into 93 topics, with each topic accompanied by a video 
recorded by 5 tutors in English and 1 tutor in Spanish. Stu- 
dents can freely select the tutor for each video. 


To rigorously set the stage for the video recommendation 
algorithm, fix a single student, and let s; (t) be the corre- 
sponding “mini-score” for topic k € {1,2,--- ,93}, at time 
t = 0,1,---. These mini-scores, representing the knowledge 
level of the student, are obtained by assessing responses to 
the mini-assessments comprising of 4-choice questions, with 
a single correct choice. Thus, the set of possible outcomes 
consists of i correct answer(s), together with 3—i wrong an- 
swer(s), for i = 0,1, 2,3. Then, we center and normalize the 
corresponding scores (henceforth referred to as mini-scores), 
so that on average, simply guessing the answers lead to a 
zero score. Thus, we have sx (t) € {—3,1,5,9}, and if the 
answers are selected completely at random, E [sx (t)] = 0. 


With the above setting, the full state of the student at time 
t is given by S(t) = [s1 (t),--- , 93 (t)!’ € {—3,1,5,9}°, 
93 


while ||S (£)|| = >> sx (t) reflects the (total) score of the 
k=1 

student under consideration at time t. The dynamical model 
for topic k consists of a Markov chain for which the state 
is s, (t). For the time being, suppose that the parameters 
of the Markov chain consisting of 4 x 4 tables of transition 
probabilities among the states {—3, 1, 5,9} are available. We 
will shortly discuss a statistical method leveraging transfer 
learning techniques, for estimating the Markov transition 
kernels according to the observed data. 


The recommendation strategy is to propose to the student 
the tutoring video corresponding to the topic with the largest 
predicted growth in the mini-score. Formally, at time t, the 
IVLE recommends the student to watch the tutoring video 
of topic k*, wherein 


ke = arg max E [sx (¢+ 1) — sx (t) |s (s)| , 


where the notation “ | ” is used to indicate a conditional prob- 
ability distribution. The student can either accept the rec- 
ommendation, watch the video and take the mini-assessment, 
or can ignore the recommendation and select another video 
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to watch (by possibly another tutor). 


Note that in order to compute the above expected values, 
for every topic i € {1,--- ,93} it suffices to have only the 4 
probabilities corresponding to the transition of the Markov 
chain from the current state s; (t) to the next one s; (t+ 1). 
Intuitively, the difference quantity sx (t + 1) — sx (t) reflects 
the predicted growth of the student in topic k. Therefore, 
the high level idea of the recommendation strategy is to 
propose to the student to work on the topic (s)he is capable 
of improving her/his knowledge level the most. Therefore, 
the recommendation aligns with Vygotsky’s theory [28] of 
zone of proximal development by providing a video that is 
neither too easy, nor too challenging. Further, the recom- 
mended topic is totally personalized to the student, since 
the state S(t) at time t is unique to each student. 


Finally, we describe the statistical learning procedure for es- 
timating the Markov transition probabilities. For this pur- 
pose, the students are clustered in 12 different groups, based 
on their demographic and other background data, so that 
students of similar learning abilities will be assigned to the 
same group (cluster). The details of the clustering proce- 
dure are provided in Section 3. We assume that students in 
each group share the Markov transition probabilities reflect- 
ing their cognitive responses to watching the tutoring video 
of a specific topic. Thus, in order to estimate the transition 
probabilities for students in a fixed group, we divide the 
total number of transitions between every pair of the possi- 
ble states {—3,1,5,9} in the group, with the total number 
of transitions in the group. We emphasize the following 
points. First, while the Markov transition probabilities are 
the same for all students in one demographic/background 
group, the states are uniquely personalized to each student. 
Second, the estimates of the transition probabilities change 
over time as the platform collects more data from the re- 
sponses of the students to the mini-assessments. Further, 
when Math Nation starts being used by the students, the 
initial estimates of the transition probabilities are selected 
randomly, and are updated throughout the academic year 
as the students continue to use it. Finally, if there is more 
than one k* maximizing the predicted growth, one will be 
selected at random. 


Before the algorithm was deployed within Math Nation plat- 
form, it was extensively tested on synthetic data generated 
based on data collected in previous years from the platform. 
Specifically, students that have used the platform in previous 
years were clustered in 12 groups (see also Section 3) based 
on their demographic and background information. Note 
that the distributions of such data are very similar to those 
in the academic year that the recommendation algorithm 
was launched and evaluated in the current study. Subse- 
quently, the response data to the mini-assessment tests of 
the students within each cluster were used to estimate the 
corresponding Markov transition probabilities. The latter 
were then used to initialize the recommendation algorithm 
and to generate synthetic data for students in different clus- 
ters. The upshot of this analysis was that the algorithm 
required adequate engagement (t > 45) to show significant 
improvement in performance in the mini-assessments. We 
revisit this point in the Discussion section. 


Table 1: Distribution of the students across different Math 
Achievement Levels, School Grades and Student Grades 


Achievement Level No. of Students | School Grades No. of Students | Student Grades No. of Students 


1 473 A 4,377 5 3 


2 1,453 B 2,001 6 1463 
Cc 4580 7 3599 
8 5893 


3 3,487 
4 2,711 
5 2,834 
Total 10,958 | Total 10,958 | Total 10,958 


Note: Data based on previous school year performance 
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Figure 1: Distribution of Pre-Score 


3. DATA DESCRIPTION 


In this study, we randomly assign a sample of 16,406 middle 
and high school students enrolled in Algebra 1 in a large 
school district in the state of Florida, to a treatment (pro- 
posed recommendation strategy) or a control (business as 
usual recommendation strategy) group. The assignment was 
blind to students and teachers. The treatment group re- 
ceived video recommendations as described in the previous 
section, while the control group received a recommendation 
to watch the next video in the curriculum sequence. To 
initialize the recommendation, a randomized cluster design 
was employed. Specifically, students were first matched ac- 
cording to their grade, school characteristics and math pre- 
paredness test scores from the previous school year and then 
randomly assigned to the two groups. The variables used 
for matching purposes were the scores on the state stan- 
dardized mathematics test, called the Mathematics Florida 
Standards Assessment‘ (henceforth, referred to as Pre-Score 
and the corresponding test referred to as Pre-Test), as well 
as an achievement level assigned to them by their schools, 
while the quality of each school is reflected by a grade as- 
signed to it by the state Department of Education”. The 
latter grades are based on several components and have five 
different levels (‘A’ being the highest level and ‘F’ being the 
lowest one). Due to lack of data for many of these variables, 
5,448 students were removed from any further analysis and 
hence Table 1 that shows the distributions of the students 
across different Achievement Levels, School Grades and Stu- 
dent Grades and Figure 1 that depicts the distribution of the 
Pre-Score are based on the remaining 10,958 students. 


‘http: //www.fidoe.org/accountability /assessments /k-12- 
student-assessment /fsa.stml 

“http: //www.fidoe.org/accountability /accountability- 
reporting/school-grades/ 
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Figure 2: Boxplots of clusters hierarchically ordered based 
on Pre-score: For each cluster, School Grade, Achievement 
Level and Student Grade and Cluster Size are reported. 


Using the above information, students were assigned to clus- 
ters/groups. This cluster assignment is used as a categor- 
ical variable in the analysis presented in Section 4. The 
clusters are designed in such a way that each of them cor- 
responds to a group with a unique combination of math 
preparedness and school grade. In summary, the following 
four variables were considered by the clustering algorithm: 
Pre-Score, Math Achievement Level, School Grade and Stu- 
dent Grade. An agglomerative hierarchical clustering algo- 
rithm was employed for this task and using the dendrogram 
with Gower’s distance metric, along with silhouette values 
[7], the number of clusters was chosen to be 12. Figure 2 
provides a pictorial representation of the key features of the 
clusters. Specifically, for each cluster the Figure depicts the 
boxplot of the Pre-Score and also the corresponding Student 
Grade, Math Achievement Level, School Grade and Size of 
the cluster. For ease of comparison, the clusters are ordered 
according to the distribution of the Pre-Score. Hence, clus- 
ter 1 corresponds to the group of students having the lowest 
Pre-Score, while cluster 12 is the group with the highest Pre- 
Score. As Figure 2 shows, the size of cluster 2 was very small 
and hence it was merged with cluster 1 for the subsequent 
analyses. 


The number of times a particular student takes the mini- 
assessment after watching a video, is defined as the usage by 
that student. Figure 3 depicts the average usage per student 
for each of the clusters for both the control and treatment 
groups. It can readily be seen that the overall average across 
the study population is 2.88, with many clusters exhibiting 
significantly lower usage. There are also a few clusters ex- 
hibiting high usage; e.g. cluster 5 for the control group and 
cluster 9 for the treatment group. 


4. METHODS AND RESULTS 


The analyses described below, aim to provide answers to 
the two objectives outlined in Section 1. In our first analy- 
sis, we estimate the average treatment effect of the recom- 
mendation algorithm on EoC scores, using a simple linear 
regression model, with the following two categorical vari- 


Per student usage across clusters 


: 
a 2.88 
2 
EEE ad ottdia 


———_ Overall average 


Figure 3: Average usage per student for different clusters, 
for both treatment and control groups 


ables and the interaction between them; (i) the first cate- 
gorical variable TC, comprises of two levels: the first rep- 
resents the Treatment group that watched the personal- 
ized recommended videos and took the corresponding mini- 
assessments, and the second level corresponds to the Control 
group; (ii) the second categorical variable Previous Achieve- 
ment Level, comprises of five categories, each corresponding 
to a different level of achievement in the Pre-test. Level 1 
stands for the lowest achievement, whereas the highest level 
is coded by level 5. Then, the linear regression model with 
the above two predictors and their interaction is given by: 


y=put fi TC + Bo Achievement+ 
63 (['C x Achievement) + € (1) 


where y represents the EoC score and we further assume 
that « ~ N(0,o). Based on this model, the estimate of 
the average treatment effect of the personalized recommen- 
dation on EoC score, is the coefficient 6; corresponding to 
the variable TC. Further, estimates of standard errors of the 
regression coefficients are based on cluster-robust estimators 
[2]. To answer the first research question discussed in Sec- 
tion 1, we test H7° : 6, =0 vs. HP° : By £0. The coeffi- 
cient (1 is the difference between the mean EoC score of the 
Treatment and the Control group, after accounting for the 
effect of all the other covariates. The estimated coefficients 
(scaled) and corresponding p-values are reported in Table 2. 
Table 2 shows that the achievement levels are statistically 
significant, while the treatment effect (i.e., the impact of the 
developed recommendation algorithm) is not. Further, there 
is a small positive significant effect for the interaction of the 
treatment with Achievement level 2. However, as shown in 
Figure 3, usage patterns vary widely across different groups 
(clusters) of students. 


To that end, and in order to gain a deeper understanding 
of how the average treatment effect behaves across different 
IVLE usage levels, we fit model (1) separately on groups of 
students exhibiting different usage levels. After some initial 
exploratory analysis, we divided the students in approxi- 
mately evenly distributed usage groups as shown in Table 
3. The results are summarized in Table 3, whose first col- 
umn specifies the usage levels of the group. As an example, 
students who have taken at least 10 mini-assessments tests, 
are categorized as a group with usage level 10 or higher. 
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Table 2: Estimated coefficients and corresponding p-values 
for Model (1) 


Variable Scaled Coefficient | p-value 
Intercept 348.45 <0.001 
Treatment -0.99 0.32 

Achievement level 2 11.28 <0.001 
Achievement level 3 23.14 <0.001 
Achievement level 4 35.31 <0.001 
Achievement level 5 50.02 <0.001 
TC*Achievement level 2 1.94 0.05 
TC*Achievement level 3 0.94 0.34 
TC*Achievement level 4 1.25 0.21 
TC*Achievement level 5 1.09 0.27 
Note: The scaled coefficients are obtained by dividing the 


estimated coefficients by their standard error. 


The second and third columns contains= the p-values cor- 
responding to the test 6, = 0 and the scaled version of the 
estimated coefficients, respectively. As it is evident from 
Table 3, the first few rows that correspond to lower usage 
groups, have high p-values and thus the average treatment 
effect is not statistically significant. The treatment effect be- 
comes significant for students who used the platform more 
extensively (> 48). 


The model also controls for the level of achievement of stu- 
dents. Table 4 presents the results for the G2 regression coef- 
ficient for different usage levels. The corresponding p-values 
are given in parentheses. It can be seen that the effect is sta- 
tistically significant (marked in bold font) across almost all 
Previous Achievement levels and usage levels, as expected 
based on the overall results presented in Table 2. Further, 
this result is in accordance with a large body of literature 
that has found a positive association between level of math 
preparation and test scores (see, e.g.,[16, 15] and references 
therein). Further, the magnitude of the coefficient is larger 
for higher achievement levels. 


Model (1) also estimates the interaction effect between the 
treatment and the Previous Achievement level. Table 5 
summarizes the scaled estimates of the interaction effects 
and the p-values (given in parentheses). Since the Previous 
Achievement level has 5 categories, we obtain the estimates 
for all the levels except the baseline category, i.e., Previous 
Achievement level 1, which is absorbed in the intercept of 
the model. As usage increases, Table 5 displays more signifi- 
cant interaction effects (in bold font) between treatment and 
achievement level as compared to low usage groups. Note 
that due to lack of data in selected categories, some of the in- 
teraction effects could not be estimated and hence left blank. 


Note that most of the interaction effects are not statistically 
significant. There are selected ones with a positive coef- 
ficient, corresponding to higher achievement levels (3 and 
above) for high usage groups (e.g., 33 and 65). Analogously, 
there are selected interaction effects with a negative coef- 
ficient corresponding to the lower achievement level 2, and 
relative high usage level. 


To answer the second research question on the relationship 
between the performance of the students in mini-assessments 
and in the EoC test, we obtain the Average Mini-Assessments 


Table 3: Usage-wise effect of the recommendation: p-values 
and scaled coefficients for different usage levels 


Usage Level | p-value | Scaled Coefficient | Sample size 
9 0.64 0.46 1097 
13 0.40 0.85 932 
27 0.29 1.07 515 
33 0.17 1.39 A411 
48 0.02 2.41 254 
52 0.01 2.56 230 
55 0.05 1.93 207 
59 0.06 1.91 183 
65 0.02 2.31 140 
74 0.08 1.78 92 


Table 4: Usage-wise effect of the Previous Achievement 
level: p-values and scaled coefficients for different usage lev- 
els 


Usage Level | Level 2 Level 3 Level 4 Level 5 
4.65 TAI 10.67 15.76 

9 (<0.001) | (<0.001) | (<0.001) | (<0.001) 
94 7.5 10.74 15.37 

13 (<0.001) | (<0.001) | (<0.001) | (<0.001) 
87 2.64 461 6.76 

27 (0.06) (0.008) | (<0.001) | (<0.001) 
2.45 3.02 472 6.19 

33 (0.01) (0.002) | (<0.001) | (<0.001) 
3.23 3.19 4.18 3.41 

48 (0.001) | (0.001) | (<0.001) | (<0.001) 
3.16 3.29 2.16 0.42 

52 (0.002) | ((0.001)) | (<0.001) | (<0.001) 
2.97 3.07 3.97 5.41 

55 (0.003) | (0.002) | ((<0.001)) | (<0.001) 
2.05 T.92 2.65 3.58 

59 (0.04) (0.05) (0.008) | (<0.001) 
0.13 T.83 415 

65 - (0.89) (0.06) (<0.001) 
0.50 1.34 2.70 

74 : (0.62) (0.18) (0.008) 


Table 5: Usage-wise interaction effect of treatment and Pre- 


vious Achievement level: scaled coefficients (p-values) for 
different usage levels 
Usage Level | TC * Level 2 | TC * Level 3 | TC * Level 4 | TC * Level 5 

“UAT “U-20 “0-0 “0-06 

9 (0.64) (0.84) (0.69) (0.94) 
“1.08 “0.68 081 “0.34 

13 (0.27) (0.49) (0.42) (0.74) 
0.62 19 05 T.60 

27 (0.54) (0.23) (0.29) (0.11) 
0.78 AT 732 Pa 

33 (0.43) (0.14) (0.19) (0.03) 
~ZBI =T-21 ~Z.05 

48 (0.005) (0.22) (0.04) 
“IBS ~Ta7 “ZS 

52 (0.004) (0.14) (0.03) 
~ZAg -LI8 “73 

55 (0.01) (0.24) (0.08) 
~ZA0 “U-71 ~T.68 

59 (0.02) (0.48) (0.09) - 

ZG ZL TBD 
65 5 (0.01) (0.04) (0.005) 
“0.98 ~T28 
74 - (0.32) (0.20) 
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Score for each of the ~ 11,000 registered students, wherein 
the average is computed over the mini-scores for all the mini- 
assessments the student has completed. 


Then, the following Analysis of Covariance model is fitted to 
the data. To control for the students math preparedness and 
school characteristics, we include the cluster information as 
a factor in the model. 


Yig = M+ 04 + Baig + iy, (2) 


wherein yi; is the EoC score and 2;; is the Average Mini- 
Assessments Score for the j*” student in the i’” cluster. Fur- 
ther, yz is the overall mean effect and a; is the additional ef- 
fect due to the assignment of the student to the i—th cluster 
that accounts for prior math knowledge, grade and school 
characteristics of the students. 


Table 6 depicts the estimated regression coefficients, their 
standard errors, together with the value of the test statis- 
tic and the p-value corresponding to the significance test 
for each of the coefficients. All p-values are significantly 
smaller than the nominal 0.05 (or 0.01) level, thus indicat- 
ing that the corresponding predictor has a significant effect 
on the EoC test score. The estimated coefficient for the 
mini-assessment is 1.15. This small, but statistically signifi- 
cant coefficient indicates that an increase of one point in the 
average student score on the mini-assessment corresponds 
to an expected improvement in the EoC score of 1.15 (the 
corresponding scaled regression coefficient is 7.46) points. 
At first glance, this relationship between the average mini- 
score performance and the EoC test seems of limited practi- 
cal significance. However, when examining the distribution 
of EoC scores across all students (~ 90,000) that used the 
Math Nation platform at some point in time (not necessar- 
ily participants in the current study), we find that about 
1.9% are within 1 point of the passing threshold. Hence, in 
light of this information, it is reasonable to posit that the 
recommendation algorithm would have been beneficial for a 
good number of students, if it were adopted and used by all 
platform participants. 


Table 6: Results of the Analysis of Covariance model: Re- 
sponse EoC Score; categorical predictor cluster and numer- 
ical predictor Average Mini-Assessments Score 


Coefficients Estimate | Std. Error | t-value | p-value 
Intercept 464.29 2.61 178.18 | <2e-16 
Cluster 3 29.79 3.63 8.21 3.9e-16 
Cluster 4 29.61 2.77 10.70 | <2e-16 
Cluster 5 37.06 2.92 12.70 | <2e-16 
Cluster 6 38.99 3.13 12.46 | <2e-16 
Cluster 7 36.84 2.77 13.29 | <2e-16 
Cluster 8 49.74 2.92 17.02 | <2e-16 
Cluster 9 53.47 2.87 18.62 | <2e-16 
Cluster 10 73.48 2.90 25.33 | <2e-16 
Cluster 11 68.89 3.21 21.49 | <2e-16 
Cluster 12 75.34 2.88 26.13 | <2e-16 
Avg. Mini-Assessments 1.15 0.15 7.46 1.3e-13 
5. DISCUSSION 


The analysis of the data from the randomized control study 
provide a number of useful insights in designing recommen- 
dation strategies for IVLE. Firstly, the recommendation al- 
gorithm holds a lot of promise, but as it is well known in rein- 
forcement learning, it requires adequate amount of usage to 


“explore” various possibilities in order to maximize expected 
reward. The adequate usage requirement is also discussed 
in the literature evaluating recommendation strategies for 
Massive Online Open Courses; see [6, 17, 18] and references 
therein. As mentioned in Section 3, an initial evaluation of 
the proposed algorithm during its development phase based 
on synthetic data indicated that it starts yielding satisfac- 
tory results, in terms of students improving their perfor- 
mance on the mini-assessments, once students follow its rec- 
ommendations for over 45 times. The results of the analysis 
in Section 4 are in line with the aforementioned finding. As 
Table 3 indicates, the recommendation strategy shows sig- 
nificant impact starting from a usage level of 48. Further, 
note that in our study the primary outcome under consid- 
eration is the EoC test that takes place at the end of the 
academic year, as opposed to a more direct outcome related 
to the recommendation algorithm, such as performance over 
time on the mini-assessment tests. In many studies in the 
literature (e.g., [1, 22], assessment of a recommendation al- 
gorithm was based on more immediate outcomes (e.g., the 
mini-assessments in our setting), as opposed to a more distal 
outcome, such as the EoC. Nevertheless, the results of our 
experiment indicate that with stronger student engagement 
the developed algorithm could be more widely beneficial. 


To address the issue of low usage, a new experiment has 
been designed, wherein the teachers are directly involved 
in the implementation of the recommendation system in the 
classroom, which is expected to yield higher levels of engage- 
ment of students with the IVLE platform. This experiment 
is under way at the time of this publication. 


It is also worth mentioning that our first analysis was of 
“Intent-to-Treat” type, because it evaluated the effect of be- 
ing randomly assigned to treatment or control groups with- 
out consideration of the extent that students used the rec- 
ommendation strategy. On the contrary, traditional Com- 
plier Average Causal Effect analysis [21, 25] is based on 
“Treatment-on-the-Treated” principle, wherein one estimates 
the treatment effect for those who complied with the treat- 
ment. The latter constitutes a direction of future research. 


Another issue of broader interest is that many IVLE recom- 
mendation algorithms are designed to assign test problems 
in an adaptive way, as opposed to assigning videos that Math 
Nation does. However, in the modified implementation of 
the algorithm currently under evaluation, the student can 
skip watching the recommended video and take the mini- 
assessment directly; in case, (s)he gets less than two of the 
questions correctly, the algorithm recommends to watch the 
segment of the video that covers the corresponding mate- 
rial and then retake the mini-assessment. This modification 
aims to enhance the emphasis of the recommendation al- 
gorithm on solving problems, but at the same time enable 
students to review relevant material to questions that they 
answered incorrectly. 
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